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Abstract 

We address the problem of multiuser scheduling with partial channel informa- 
c/5 | tion in a multi-cell environment. The scheduling problem is formulated jointly with 

the ARQ based channel learning process and the intercell interference mitigating 
cell breathing protocol. The optimal joint scheduling policy under various system 
constraints is established. The general problem is posed as a generalized Restless 
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1 Multiarmed Bandit process and the notion of indexability is studied. We conjec- 

ture, with numerical support, that the multicell multiuser scheduling problem is 
indexable and obtain a partial structure of the index policy. 



■ Index Terms-Markov channel, cellular system, downlink, ARQ, multiuser schedul- 

! ing, multi-cell, greedy policy, cell breathing, dynamic program, POMDP. 

o 



x 



1 Introduction 

Cellular wireless networks, typically characterized by a central controller (base station) 
coordinating downlink and uplink transmissions to and from users within a cell, has been 
a popular model among wireless network designers. A well known application deploying 
the cellular network model is the cellular wireless telephony pQ. In recent years, there 
has been a tremendous increase in the demand for high data rates in these systems. 
The need for spectrally efficient communication strategies is thus on a steady rise. This 
is particularly serious in cellular systems that are prone to scalability issues. One such 
strategy is the opportunistic multiuser scheduling proposed by Knopp and Humblet, [2], in 
a single cell environment. Opportunistic multiuser scheduling can be defined as allocating 
the system resources to the user(s) experiencing the most favorable channel conditions 
at the moment. It is particularly well suited to the cellular environment thanks to the 
presence of the base station, a central coordinating authority. 

Opportunistic scheduling has since been studied extensively under various scenarios [3j 
IHEl El [7J. For a general treatment on the topic with minimal physical layer assumptions, 
see jH]. It is understandable that the availability of the channel state information at 
the scheduler is crucial for the success of the opportunistic scheduling schemes. A vast 
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majority of the literature on this topic make the unrealistic assumption of readily available 
channel state information at the scheduler. In reality, however, a non-trivial amount of 
resource must be spent in gathering the information on the channel state. This leads us 
to the following critical question: Are there efficient joint channel acquisition - multiuser 
scheduling schemes for a cellular system ? For single cell systems, this issue has recently 
been addressed in independent works [9] and [10] in the contexts of multiuser downlink 
and cognitive radio, respectively. These works exploit the memory inherent in the fading 
channels along with the ARQ feedback used for error control purposes for opportunistic 
channel aware scheduling. They model the fading channel with memory by a Gilbert 
Elliott channel ([HI [121 131 EEU EE EES]) and identify the effect of channel scheduling 
decisions on the channel information acquisition process and vice versa. By formulating 
the joint optimization problem as a dynamic program, they show that a greedy policy that 
maximizes the immediate reward (reward defined to reflect data rate in [9]) is optimal. 
The policy is also shown to be remarkably simple to implement. Notice that there is no 
additional overhead incurred by the channel acquisition process. These works underline 
the potential for tapping into existing resources in the system (in this case, the channel 
memory and the ARQ mechanism already in place for error control) for low overhead 
channel aware scheduling. 

Rarely in realistic scenarios do we encounter a single cell cellular system. The very 
idea of cells was conceived with an intention to control wireless communication between 
users spread over a large geographic area by splitting it into small manageable cells. Thus 
transmission in a cell interferes with the transmissions in the adjacent cells. It follows 
that the channel state of any user in a cell is a function of the transmissions and schedule 
decisions in the adjacent cells, effectively imparting a convolved dependence between the 
scheduling choices in neighboring cells. We now face the following question: How does 
the easily implementable, low overhead, ARQ based joint channel acquisition - multiuser 
scheduling scheme extend to the multi-cell environment ? 

We address this problem by following a two layered approach: A well established 
inter-cell interference (ICI) control mechanism is adopted and assumed to be in place. 
On top of this layer we optimize the ARQ based multiuser scheduling scheme across 
the cells. We now proceed to introduce our choice of the ICI mechanism after a short 
literature survey on the topic. 

Traditionally, ICI is controlled by staggering the transmissions in adjacent cells across 
orthogonal frequency bands and reusing these bands in geographically far-apart cells. 
This is the well known frequency reuse based ICI control mechanism [1J that is prevalent 
in narrow band systems, such as the GSM. Other ICI control mechanisms have also been 
studied. In |T7j, a capture division packet access (CDPA) mechanism is proposed. Here, 
users are allowed to transmit on the same carrier in adjacent cells, i.e., no frequency 
reuse based ICI control is deployed. The effect of interference is quantified by a capture 
probability defined as the likelihood of successful transmission under ICI. Upon colli- 
sion, a retransmission is performed. The authors demonstrate that CDPA outperforms 
traditional TDMA based strategies under certain operating conditions. 

Notice that, in the preceding scenario, users at the periphery of the cell suffer from 
low signal to interference ratio and hence low capture probability compared to the users 
near the base station. This is the classic near- far effect. If this is not addressed properly, 
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under QoS requirements on fairness across users, the far users will act as a bottleneck thus 
bringing down the overall system utility. Taking note of this crucial phenomenon, the 
authors in [TB] propose a novel reduced power channel reuse (RPCR) scheme that aims 
to equalize the capture probabilities of the near and far users. By formally classifying the 
users into two groups: near and far (based on a generic "distance" metric that need not 
be a function of the geometric distance), RPCR works as follows: If, in a carrier, a far 
user is scheduled in a cell, the power of transmission in the same carrier in the adjacent 
cell is deliberately reduced. This power reduction naturally limits transmission to near 
users in the adjacent cell. Thus, at any time, in any carrier, cell 1 and cell 2 transmits to 
users belonging to complementary groups with different power levels (full power for the far 
user). With the availability of multiple carriers, in a cell, carriers are classified as primary 
and secondary with far users assigned to the primary carriers (at full power level) and 
near users to the secondary carriers (at reduced power levels). This primary /secondary 
classification is reversed in the adjacent cell. With this arrangement, the two cells need 
not coordinate their transmissions to maintain the main idea of the RPCR. The authors 
formulate and study the optimal channel selection policy that assigns users to the near/far 
groups. They show that the RPCR scheme is superior in performance to other ICI control 
mechanisms in terms of sum throughput under uniform fairness constraints. 

In [19], the authors take a fundamental, information theoretic approach towards ICI 
control. They show that varying power across carriers in a complementary fashion across 
adjacent cells with far users assigned to carriers with higher power and vice versa im- 
proves the overall capacity region and hence the spectral efficiency of a two-cell system. 
For reasonable fairness constraints, the spectral efficiency was shown to be better in 
comparison to the traditional frequency reuse based ICI control. This is consistent with 
the observation made in [18] with respect to the RPCR scheme. In [19], the authors 
also consider a multi-cell system with wideband mobiles. Here the mobile carriers are 
assumed to communicate over the entire available spectrum, a possibility in wideband 
systems. Thus, with only one carrier available, the luxury of varying power across car- 
riers is absent. For this scenario, the authors propose a cell-breathing scheme where, in 
a cell, the power allocated to the carrier varies rhythmically across time, in a fashion 
complementary to the adjacent cells. Thus, at no time, the adjacent cells transmit at the 
same power (and hence to the same group users). This rhythmic power variation across 
time resembles a breathing pattern, hence the name cell breathing. Notice that the main 
idea of equalizing the capture probabilities of the near and far users and hence the utility 
gains associated with the approach are retained in the cell breathing technique. This 
is demonstrated in [19J. The cell-breathing protocol for wideband systems was studied 
further in [20]. Here the authors obtain achievable rate regions associated with the cell 
breathing strategy when it is deployed alone and in combination with several precoding 
schemes. 

Encouraged by the positive results associated with the RPCR and the cell breathing 
techniques, we adopt cell breathing as our ICI control mechanism. Note that we are 
assuming a wideband system with only one carrier available. We will later show that, 
with regard to our analysis in this work, this is a general case than the system with 
multiple carriers. If the channel of the users are time- variant, it is readily seen that, 
without violating the cell breathing protocol, the performance of the system can be 
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improved by (joint) opportunistic multiuser scheduling with coordination across cells. 
We address this joint opportunistic scheduling problem in a two-cell syste in this work. 
By demonstrating that the channel can still be modeled by i.i.d two-state Markov chains, 
like in the single cell case, we study the ARQ based joint opportunistic scheduling scheme 
in the two-cell system, under various scenarios: (a) When the cooperation between the 
cells is not symmetric, (b) When there are restrictions on the breathing pattern. Here the 
breathing pattern refers to the time-sequence of near-far-near. . . users scheduled across 
time. A simple illustration of the breathing pattern with and without opportunistic 
scheduling is available in Fig. [TJ (c) When there is complete cooperation with no 
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Figure 1: Breathing Pattern Illustration: (a) The breathing is rhythmic without joint 
opportunistic scheduling (b) The breathing pattern is influenced by joint opportunistic 
scheduling. 

restriction on the breathing pattern. In the last scenario, a direct analysis of the problem 
appears intractable. We therefore establish a link between our problem and the Restless 
Multiarmed Bandit (RMAB) processes. We introduce the notion of indexability from the 
RMAB theory and perform an indexability analysis for the current system. We claim, 
with numerical support, that the scheduling problem at hand is in fact indexable and 
partially characterize the link between the index based policy and the greedy policy. 

The report is organized as follows. The problem setup is described in Section [2] 
followed by a study of the optimal ARQ based scheduling policy under different system 
requirements in Section [3J We establish the connection between the scheduling problem 
and RMAB processes along with an overview of the notion of indexability in Section |H 
We perform an indexability analysis of the scheduling problem in Section [5j In Section [61 
we partially characterize the link between the index policy and the greedy policy. 

1 Extension to multi-cell system is discussed in the next section 
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2 Problem Setup 



2.1 Channel Model 

Consider a two-cell system. Consistent with [T81 EH] , within each cell, we cluster users 
into near and far users. We use geometric distance between the users and their respective 
base stations as the metric for this classification. Denote by /j, the set of near and 
far users, respectively in cell i G {1,2}. An user in a group is denoted by the label of 
the group for notational simplicity. Let the distance between base station % and user j 
(in any cell) be given by cfy. By way of the two level clustering we assume, dif. is the 
same for all far users in cell %. Likewise di ni is uniform for near users in cell i. Let N 
be the number of near users in the cells and F the number of far users. Denote the 
normalized (with respect to attenuation loss) fading coefficient of a link between base 
station i and user j (in any cell) as hy. We assume are i.i.d distributed. Consider 
cell 1 as the primary cell and cell 2 the interfering cell. If a far user / is^| served in the 
primary cell with power Pf and if the interfering base station is transmitting at power 
Pi f (If indicates interference to the far user in the primary cell) then the SINR at this 
user is given as below. 



p f \h 



|2 



SINR/ = (1) 

N + ^\h 2f \* 

where Nq indicates the variance of the additive noise. We have used the attenuation 
model from [21] with a > 2 being the attenuation coefficient. Likewise, if a near user is 
served in the primary cell with the interfering base station power being Pj n , the SINR is 
given by 

-^-l/ii I 2 

SINR n = ^ . (2) 

AT _l_ Pl « \h |2 

a 2n 

An illustration of these two scenarios is provided in Fig. [2j 

Consistent with the two level clustering we assume, the base stations are allowed 
to choose one of two power levels, i.e., Pf,P n ,Pi,,Pz n G {P^P^ with P 2 < Pi. By 
observation, since c?i/ > di n , the avarage SINR of the far and near users can be equalized 
if Pi f < Pi n and Pf > P„ . This, along with the constraint on the alphabet size of the 
power levels, leads to the cell breathing rule [TU [T9I [20] : A far user is served with power 
Pi and a near user with power P 2 < P\. Whenever a far user is scheduled in a cell, a 
near user is scheduled in the adjacent cell and vice versa. 

Since the links between the base stations and users hij are i.i.d, with the SINR values 
equalized under cell breathing, we have the following: SINR^, SINR ni , SINRf 2 , SINR n2 
are i.i.d. The fading coefficients were assumed to evolve with memory in [9], leading to the 
GE model. Retaining this assumption on the fading coefficients and assuming a threshold 
based decoding rule as below: SINR > xl =^ decoding success and SINR < 7=^> decoding 



2 We have dropped the suffix 1 as the context is clear. 

3 The value of the threshold being a function of the application and the decoder in use 



5 



Cell 1 



(Interfering) Cell 2 



Figure 2: Illustration showing transmissions and interference caused when a far user and 
a near user are served (at different times). 



failure, we retain the GE model for the channels of the users. Specifically, the channel 
of each user is modeled by an i.i.d two-state Markov chain, with the ON state allowing 
the successful transmission of a fixed length packet. The channel of each user remains 
fixed through a time slot and evolves into another state in the next slot according to 
the Markov chain statistics. The time slots of all users are synchronized. The two-state 
Markov channel is characterized by a 2 x 2 probability transition matrix 



P 



P q 

r s 



(3) 



where 



• p :— prob(channel is ON in the current slot | channel was ON in the previous slot) 

• q := 1 — p 

• r := prob(channel is ON in the current slot | channel was OFF in the previous slot) 

• s := 1 — r. 

Throughout this work, we consider positively correlated (in time) channels, i.e., p > r. 

It is worth noting that the scheduling analysis for the two-cell system readily extends 
to a multi-cell configuration with the use of six-directional antennae ([TJ) at the base 
stations. Each cell can now be divided into six regions and the joint scheduling analysis 
in this work can be applied in each region independently. This is illustrated in Fig. [3J 



2.2 Scheduling Problem 

The base stations are the central controllers that control the transmission to the users 
in each control interval within their respective cells. In each control interval, the base 
stations do not know the exact channel state of the users and must schedule the trans- 
mission of the head of line packet of exactly one user (a data queue is maintained for each 
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Multi-Cell Extension Using Six-Directional Antenna at the Base Stations 



Figure 3: Multi-cell extension: with six directional antennae at the base stations, each 
cell can be split into six regions and the two-cell joint scheduling can be performed on 
these regions independently. One such region is highlighted. 

user to collect the data meant for that user) each, while maintaining the cell breathing 
protocol. In any cell, if a far user is scheduled, transmission takes place at full power Pi. 
For a near user the lower power P 2 < Pi is used, in accordance with cell breathing. A 
traditional ARQ based transmission is deployed in each cell. Here, in each cell, at the 
beginning of a time slot, the head of line packet of the scheduled user is transmitted. If 
the packet does not go through, i.e., it cannot be decoded by the user (when the channel 
is in OFF state), a NACK is sent back from the user at the end of the slot, and the 
packet is retained at the head of the queue for retransmission at a later opportunity. If 
the packet goes through (ON state), an ACK is sent back and the packet is removed from 
the queue. The ARQ feedback is assumed to be transmitted over a dedicated error free 
channel. At the end of the slot, the base stations of the neighboring cells share this ARQ 
information. Thus each base station has all the channel information that are available to 
its neighbor. This is crucial in the cell breathing based joint scheduling. Thus, effectively, 
we may consider the base station pair as a single control entity when it comes to joint 
scheduling based on ARQ feedbacks. The performance metric that the base stations aim 
to maximize is the sum throughput of the system. Details are discussed next. 

2.3 Formal Problem Definition 

We now introduce the terms/entities that we use in this work, many of which are borrowed 
from the POMDP [22] literature. 

Control interval k: Each time slot in our problem setup will henceforth be called a 
control interval. The scheduling process horizon is fixed. A control interval is indexed 
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by k if there are k — 1 more intervals until the horizon. 

Action (ajt, afc): Indicates the indices of the user pair scheduled in cells 1 and 2 
in control interval k. With cell breathing in place, we have the following constraint: 
(afc,<2fc) G {(n.1,/2)) (fi, n 2)}- We denote this admissible set by B. 

Belief values at the k th control interval: Denote by 7r£, the vector of the belief values 
(the probability of having an ON state) of the users in cell c at time k. Let F£ indicate 
the ARQ feedback received at the end of control interval k in cell c. We denote an ACK 
by 1 and a NACK by 0. The belief values in cell 1 evolve as below: 



where the first case indicates that, in cell 1, user i is scheduled in control interval k and an 
ACK feedback was received. Thus, according to the Markov chain statistics, 7^-1 W = P- 
The second case is explained similarly when a NACK feedback is received. The last case 
indicates that user % was not scheduled for transmission in control interval k and hence 
the cell 1 base station must estimate the belief value at the current control interval from 
that at the previous control interval and the Markov chain statistics. Similar evolution 
holds for cell 2. It is a well known fact ([22]) that the belief values are sufficient statistics 
to any information about the channels in the past control intervals, in our case, the 
scheduling decisions and the ARQ feedbacks from the past. Thus the joint scheduling 
decision in any control interval can be solely based on the belief values for that interval 
and not on the past ARQ or schedule information. 

Scheduling Policy Sfc: A scheduling policy in the control interval k is a mapping 
from the belief values and the control interval index to an action as follows: 



Reward Structure: In any control interval k, in cell c, a reward of 1 is accrued when the 
transmission in cell c is successful, i.e, when F% = 1, and no reward is accrued otherwise. 
The total reward in control interval k is simply the sum of the rewards accrued by cells 1 
and 2. Note that this reward structure is defined to be consistent with our performance 
metric, the sum throughput (to be discussed shortly). 

Net Expected Reward in the control interval m, V m : With the belief vectors, tt^, 7r^, 
and the scheduling policy, {2lfc}fc< m , fixed, the net expected reward, V m , is the sum of 
the reward, i? m (7r^, 7r^, a m , a m ), expected in the current control interval m and E[K re _ 1 ], 
the net reward expected in the future control intervals conditioned on the belief vectors 
and the scheduling decisions in the current control interval. Formally, 

in 

= Rm^m, tt^, a m , a m ) + E[y m _i(7r 7 l rt _ 1 ,7r ? 2 n _ 1 ,{aA ; }fc< m -i)|vr ) l re ,7r^,a m ,a m ], (5) 

where the expectation is over the belief vectors Tr^-i' "^m-i- Since the reward in each 
control interval in each cell is either 1 or 0, the expected current reward can be written 
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(4) 



®k ■ ({7il,nl},k) (a k ,a k ) G B. 



S 



as 



R m (Tr m ,a m ) = 7r,^(a m ) + 7r^(a m ). 

Performance Metric- the Sum Throughput, r\ sum : For a given scheduling policy {&k}k>i, 
and initial belief vectors tt},tc], the sum throughput is given by 

Vsnm{{^k}k>i) = hm — . (6) 

m-*oo m 

Optimal Scheduling Policy, {2lfc}fc>i: 

{K}k>i ■= arg max ^ SU m({3lfc}fc>i)- (7) 

Before we analyze scheduling in the original system, we consider a few variants of the 
system in the next section. 



3 Optimal Policy for Variants of the System 

3.1 When the Cooperation between the Cells is Asymmetric 

Consider a system where the cell breathing is deployed by the following asymmetric 
cooperation between the cells: base station 1 schedules transmission to its users without 
any regard to the decisions in cell 2, while base station 2 chooses the group of the user 
to be scheduled based on the user group choice of base station 1. Base station 1 is aware 
of this compromise made by base station 2 and therefore adopts the two state Markov 
model for the channels of its users, which is valid only under cell breathing. Such an 
asymmetric cooperation can result in scenarios such as (1) Cell 1 may cover the heart 
of a city with higher data rate requirements compared to cell 2 that covers the suburbs. 
(2) The sharing of ARQ feedback information between the adjacent base stations may 
not be mutual due to a link failure between the base stations. (3) In the context of game 
theory, when base station 1 is a selfish player and base station 2 is a rule-abiding player. 

Consider cell 1 under the asymmetric cooperation scenario. Since base station 1 does 
not make any effort in maintaining cell breathing, the multiuser scheduling problem in 
cell 1 becomes the same as the multiuser scheduling in an isolated cell. For an isolated 
Markov modeled downlink with N users, the greedy policy that maximizes the immediate 
reward is shown to be optimal in [H] (for N < 3 users) and [TD] (for any N). Thus base 
station 1, under asymmetric cooperation, implements the greedy policy within its cell. 

We now proceed to study the optimal scheduling policy in cell 2. Fix a realization 
of the channel states of the users in cell 1 from time t > until the horizon. With a 
fixed scheduling policy in cell 1 (in this case, the greedy policy), we can define a sequence 
of time instants {i n >£n-ij • • -ti} with t > t n > t n _ 1 ,...t 1 > 1, where a near user is 
scheduled in cell 1. Note that n is the number of control intervals from t when a near 
user is scheduled. Thus at control intervals {t, t — 1, ... 1} — {t n , i n -i, • • • a far user is 
scheduled in cell 1. Define := {tk,tk-i, . . .ti} with k < n. Note that, in the sporadic 
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time axis t n ({£, t — 1, ... 1} — t n ), base station 2, in order to maintain cell breathing, 
schedules far (near) users. 

Let $ and <3 n denote the scheduling policies adopted by base station 2 in the sporadic 
time axes corresponding, respectively, to near and far schedule decisions in cell 1. Let 

$^,2t n } denote the two-cell scheduling policy with §L indicating the use of greedy 
policy in cell 1. We now introduce the following lemma. 

Lemma 1. Under the asymmetric cooperation assumption, if, for any fixed t, for every 
realization of {n, t n } ; the scheduling policy {& , < S n } is throughput maximizing in cell 2, 
then {a,a / ,sar} is sum throughput optimal. 

Proof. The lemma is not obvious due to a possible influence of the policies $ and 21™ on 
the sporadic time axis t n , potentially invalidating the realization based argument. Under 
the asymmetric cooperation assumption, since base station 1 makes near/far schedule 
decisions without consulting base station 2 and since the channel states evolve indepen- 
dently at the underlying physical layer 0, {n, t n } is independent of the schedule decisions 
and observations made in cell 2 within the sporadic time axes and hence is independent 
of $ and 2T\ This decoupling along with the fact that the greedy policy is optimal in 
cell 1 establishes the lemma. □ 

We now proceed to show that the greedy policy is optimal within a realization of 
the sporadic time axes. Note that the greedy policy was shown to be optimal on a 
non-sporadic time axis in [9j [10]. However, in the current case, since the belief values 
evolve across non-uniform time steps, we need a rigorous optimality proof in the changed 
setting. 

Fix a realization {n, t n } throughout the following analysis. The net expected reward 
accrued by base station 2 from tk< n on the time axis t n is given as follows. 

V t k {n k ,{ a t k ,{K}k>l>o}) = TTtfcKJ+E (TTt^, {fl^}*_l>I>oKtft. Ot,,) 

where n tk is the vector of the channel states of the far users in cell 2 in control interval tk, 
a tk is the far user scheduled in control interval tk and {Sf ; }fc>z>o is the scheduling policy 
from the next instant till the horizon on t n . We now establish the structure of the greedy 
policy on the sporadic time axis. In any control interval tk, k < n, the belief values of 
the users are given as follows. 

*t* (i) = < r, if % = a tk+1 , F tk+1 = (9) 

It(^-^K +1 (z)), iii^a tk+x . 

where F t is the ARQ feedback received by base station 2 from the scheduled user in 
control interval t with 1(0) corresponding to a ACK(NACK). Note that for < x < 1, 
T(x) = x(p — r) + r. Thus T(x) G [r,p\. Since T l (x) = T(T l ~ 1 (x)), by induction, 



4 Note that the inter-cell, intra-cell base station to user links are assumed to be statistically indepen- 
dent. 
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T (x) G [r,p], when I > 0. Also T(xi) > T(x 2 ) if > #2- Thus, by induction 



T l (x\) > T l (x 2 ) if X\ > x 2 . We now introduce the schedule order vector O t as the 



ordered arrangement of the index of the users in decreasing order of 7Tt fc (z), i.e., 

O tk (l) = argmax7r fc (i) 

i 

Ot k {N f ) = argmin7r fc (2). 

i 

where Nf is the number of far users in cell 2. From the preceding discussion on the 
structure of T(x) and the evolution of the belief values, the schedule order vector evolves 
as below: 

0t = |K i°t k - a* J], if ft k = 1 ^ 
\[{°t k -at k } a th ], if f tk = 0, 

The greedy policy which aims to maximize the immediate reward (belief value), picks 
the user on the top of the schedule order vector and thus has a round-robin structure, 
with user switch triggered by a NACK, on the sporadic time observed on the 

non-sporadic axis 0, [TO] . We now proceed to show the optimality of the greedy policy 
on t n by first deriving a sufficient condition for optimality, similar to our analysis in [all]. 

Consider a control interval t m ,m < n with belief vector 7r trn and action at m . Let 
the users be indexed in the order of their belief values in control interval t m , i.e, Ot m = 
[1 . . . Nf]. Assuming {^L tk }k<m-i = {% k }k<m-i- Let S tk , the state vector, denote the 1/0 
channel states of the users at We write the net expected reward as follows 

Vt m (TI~ tm , {a tm , {% k }k<m-l}) 

where V tm _ 1 is the expected future reward conditioned on the state vector in the previous 
control interval on the sporadic time axis, i.e., t m . The hat on this quantity emphasizes 
the use of the greedy policy in all k < m—1. Ps tm \nt m (St m \TTt m ) is the conditional prob- 
ability of the current state vector S tm given the belief vector n tm . Note that the schedule 
order vector O tm _ 1 is only a function of O trn and the state S tm (a tm ), thus maintaining 
consistency with the amount of information available for scheduling decision in the actual 
problem setup. We now proceed to compare the net expected reward when at m = n and 
at m = n + 1 where n G {1 . . . Nf — 1}. Let Y and X be random binary vectors of lengths 
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n — 1 and Nf — n — 1 (empty when the length is non-positive) respectively. Then, 

Vt m (n m , {a tm = n, {itjfc<m-i}) 
= ^(n) + J>s <ro K»(^ X]\-ir tm ) x V tm _ x ([Y X], [{O tm - n} n]) 

Y,X 

+ E P ^l^(t y 1 X ]l^) x 1 [{Ot m -n} n]) 

Y,X 

+E p ^i^([ y 1 1 x ii^) x ^([ y 1 1 x \, {°*» -»}]). 

yx 

(11) 

Since the Markov channel statistics are identical across the users, we have the following 
symmetry property: for any k > 1, 

if S tk+1 (O tk (i)) = S tk+1 (O tk (i)) Vie{l. ..#,}. (12) 

Expanding V tm {ji tm1 {a tm = n+1, {i tfc } fc < m _i}) along the lines of ([II]), and using the sym- 
metry property, with further mathematical simplification, we can evaluate the difference 
in the net expected reward as follows, 

VLK m , Wt m = n, {St fc }fc<m-i}) - Kni^tm, {a tm =n + l, {&t k } k <m-i}) 
= (n m (n) - n tm (n + 1)) (l - £ [[V^ 1 X 0], [1 . . . N f }) 

Y,X 

-V tm _ x ([lF04[l...JV / ])]x P StmWm ([S tm {\) . . . S tm (n - 1)] = Y\n tm ) x 
P Stm \, tm ([S tm (n + 2)...S t jN f )] = X\* tm j\). (13) 

Lemma 2. Greedy policy maximizes the throughput on the sporadic time axis t n if the 
following (sufficient) condition holds. 

V tm _ 1 ([YlX0],[l...N f ])-V tm _ 1 ([lY0X],[l...Nf]) < 1, (14) 

Vn>m>l, n6 {1... Nf — 1}, Y , X being random binary vectors of length n — 1 
and Nf — n — 1 and V tm _ 1 is the reward accrued under the greedy policy, i.e., when 

X&t h fk<m-l — {%; k }k<m-l- 

Proof. Let condition ( |T6l) be true. Let m > 1 be fixed. Since, by assumption, nt m (n) > 
ir tm (n + 1) Vn G {1 . . .Nf - 1}, we have from ffT3]) . 

^ m (^ m , H m = n, {itjfc<m-i}) > Vi m (vr tm , {a im = n + 1, {ItJ fe < w -i})- 
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Therefore, 



Vt m (^t m ,{ a t m = argmax7r fm (i) = 1, {% k }k< m -x}) 
> V tm (n tm ,{a tm G {2...N},{it k } k < m ^}). 

We now have the following statement: 
IfVTT^GfO,!]^, 



{&t k }k<m-i = arg max V r tm _ 1 (7T tm _ 1 , {9* Jjfe< m -i), 

{»t fc }fc<m-l 



then V7r tm G [0, 1] 



{®t k }k<m = arg max V^fat™, {3* Jfe<m)- (15) 

{3t fe }fc<m 



Since 9^ = arg maxg ti ^(vr^, 9^), G [0,1]^', using ffl5|) . by induction, we have 

{9 t J fe < m = arg max V^tt^, {9tJ fc < m ) 

i**t k }k<m 

Vn>m> l,7r tm G [0,1]"'. 

The lemma thus follows. □ 

We now formally introduce the optimal multiuser scheduling policy in the two-cell 
system with asymmetric cooperation. 

Proposition 3. The policy {9, 9,9} maximizes the sum throughput of the two-cell 
system. 

Proof. We begin by establishing that the sufficient condition in fact holds, using the round 
robin structure of the greedy policy on the sporadic time axis. Consider a realization of 
the channel states of the Nf users on the time axis t m _i, m < n. Denote it by {R, i,j}, 
where i, j indicate the channel state of users n + 1 and Nf, respectively, at time t m _i 
with R indicating the rest of the channel state realization. We can rewrite the second 
quantity of the sufficient condition as follows. 

V tm _ 1 ([lY0X],[l...N f ]) = V tm _ 1 ([Y0Xl],[N f ,l...N f -l]) 

Define V a ({R, i,j}) as the reward accrued from time £ m _i on the sporadic axis when the 
channel states have a realization {R,i,j} and the greedy policy is implemented in the 
order [1 . . . Nf] from control interval i m _i. Let Vb({R, i,j}) be similarly defined with the 
order given by [N f , 1 ... N f - 1]. Let P({i,j}\{k, I}) = P(S t _M + l) = i, S^Nf) = 
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j\S tm (n + 1) = k, S tm {Nf) = I). The sufficient condition can now be rewritten as below. 

V tm _, {[Y 1 X 0], [1 . . . N,]) - V tm _ x ([1 Y X], [1 . . . N f ]) 
= P (R\StM) ■ ■ ■ StM = Y, S tm (n + 2)... S tm (N f ) = X) x 

R 

= V({1, 0}|{1, 0})V a ({R, 1, 0}) - V({0, 1}|{0, l})V b ({R, 0, 1}) 
+ V({0, 1}|{1, 0})V a ({i*, 1, 0}) - V({1, 0}|{0, l})V b ({R, 0, 1}) 
+ P({0, 0}|{1, 0})V a ({R, 1, 0}) - V({0, 0}|{0, 0, 1}) 

+ v({i, i}|{i, o})y a ({ J R, i, o}) - p({i, i}|{o, o, i}) 

= p(l - r)(V a ({R, 1, 0}) - H(R 0, 1})) + (1 - p)(r)(V a ({R, 0, 1}) - V b ({R, 1, 0})) 
+ (1 -p)(l - r){V a {{R, 0, 0}) - V b ({R, 0, 0})) + pr(V a ({R, 1, 1}) - V b ({R, 1, 1})) 

(16) 

It has been shown in [10] that when greedy policy is implemented in orders [1 . . . Nf] 
and [Nf, 1 . . . Nf — 1], the difference in reward accrued, for any fixed realization, is upper 
bounded by 1. The sample path argument used in the proof works for the non-sporadic 
axis as well. Thus V a ({R,i, j}) — V b ({R,i, j}) < 1 for any {R, i,j}. Notice that since 
the realization is fixed and since V a ({R,i, j}) schedules user 1 first, the value of j does 
not affect V a . Thus V a ({R,i, 1}) = V a ({R, i, 0}). Similarly V b ({R, l,j}) = V a ({R,0,j}). 
Using these observations in ([16]) . we show the sufficient condition holds. The proposition 
is thus established from Lemma 1 and Lemma 2. □ 

3.2 Under Symmetric Cooperation With Constraints on the 
Breathing Pattern 

Assume the cells cooperate mutually in maintaining the cell breathing protocol. Assume 
there is a constraint on the breathing pattern, i.e., cell 1 must breathe-out (schedule 
far users) and cell 2 must breathe-in in a predetermined exhaustive sequence of control 
intervals t. We have the following observation. 

Proposition 4. Under a fixed t, the optimal scheduling policies are decoupled across the 
cells. The optimal policies within each group in each cell is a greedy policy. 

Proof. We see this readily from the following two observations: 

• Decoupling: with t fixed, within t, the schedule decision of a cell does not affect 
the schedule decisions of the neighboring cells, since there is no burden of main- 
taining the cell breathing protocol, any more. The same argument holds for the 
complementary time axis. 

• We have earlier shown that, for a fixed realization of the sporadic time axis, if a 
cell has to schedule an user from only one group, then greedy is optimal within 
that axis. 

□ 
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We have a straightforward corollary to this result. 

Corollary 5. When there are multiple orthogonal carriers available to the cells, with 
users assigned to the carriers such that no single carrier serves a near-near or far-far 
pair across the cells, then it is optimal to greedily schedule users within every carrier 
within each cell. 

Note that the decoupling argument of proposition is used here. Availability of multiple 
carriers is usually the case in narrowband systems. 

4 Restless Multiarmed Bandit Processes 

A direct analysis of the ARQ based scheduling problem with symmetric cooperation and 
no constraints on the breathing pattern appears very difficult due to the complex rela- 
tionship between the schedule decisions across space and time. We therefore establish a 
connection between the scheduling problem and the restless multiarmed bandit processes 
(RMAB) and make use of the established theory behind RMAB in our analysis. We 
proceed with a survey on the RMAB theory. 

Multi-armed bandit problems [23J are defined as a family of sequential dynamic re- 
source allocation problems in the presence of several competing, independently evolving 
projects. They are characterized by a fundamental tradeoff between decisions guaran- 
teeing high immediate rewards versus those that sacrifice immediate rewards for better 
future rewards. Several technological and scientific disciplines such as sensor manage- 
ment, manufacturing systems, economics, queueing and communication networks (|23j) 
encounter resource allocation problems that can be modeled as MAB processes. In the 
classic MAB processes, in each control interval, a single project has to be allocated the 
available system resources. The state of the project thus scheduled evolves from the 
current time slot to the next time slot. Whereas, those not scheduled remain frozen. 
Gittins and Jones ( [24J ) studied these processes and showed that the optimal solution 
is of the index type: i.e., for each bandit process (project or an arm of the MAB), an 
index that is a function of the state of the project is computed and the project with the 
highest index is scheduled. The index was called by the authors as the Dynamic Alloca- 
tion Index, but is now justly known as Gittins index. Note that the optimal scheduling 
policy, that originally required the solution of a iV-armed bandit process (N being the 
number of projects), is now reduced to determining the Gittins index for N single armed 
bandit processes, thus exponentially reducing the problem complexity. This complexity 
reduction is one of the main reasons behind the immense interest in index policies for 
MAB processes and its variants. We will discuss this next. 

Whittle [22] generalized the MAB processes as follows: In each control interval, ex- 
actly M > 1 projects are scheduled. The states of the rest N — M projects are not frozen 
like in MAB, but evolve in time. They also contribute rewards (W) known as passivity 
rewards. These processes are called the Restless multiarmed bandit processes (RMAB), 
the term restless being indicative of the state evolution of even the projects that are not 
scheduled at the moment. Considering a single project, Whittle defines the W-subsidy 
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policy as follows: In each control interval, schedule the project if the sum of the immedi- 
ate reward and the future reward corresponding to an active decision outweighs the sum 
of the reward for passivity (W) and the corresponding future reward. For a state 7r of 
the project, the value of W corresponding to equal net rewards for active and passive de- 
cisions is defined as the index I {it). The notion of Indexability is now defined by Whittle 
as below: 

Let D{W) be the set of states for which a project would be made passive under a 
W -subsidy policy. The project is indexable if D{W) increases monotonically from <ft to S 
as W increases from — oo to oo 

where is the empty set and S is the universal set of the states of the project. The 
notion of indexability gives a consistent ordering of states with respect to the indices, 
i.e., if I{iri) > Ifaz) and if it is optimal to activate the project when in state 7T2, then it is 
optimal to activate the project when it is in state n\. Returning to the RMAB scheduling 
problem, Whittle proposes the index scheduling policy: In each control interval, activate 
the M projects that have the greatest indices. Note that the natural ordering of states 
based on indices (under indexability) gives credibility to the index policy. Whittle shows 
that when the strict constraint on the number of projects per interval (M) is relaxed to 
an average constraint, the index scheduling policy is optimal. He also shows that when 
the restlessness aspect is removed from the RMAB and when W = 0, the index reduces 
to the Gittins index. He conjectured that, with ^ fixed, as M, N — > oo, the index policy 
is optimal. This was later proved to be true in |26j except for very special cases of RMAB 
processes. 

Indexability is a very strict requirement ([25]) that is hard to check. There have 
been several works [26] [27] [28] [29] [30] ETJ on indexability and index policies for various 
RMAB processes. In [26], for a special class of RMAB, the authors show that, if the 
RMAB is indexable, under certain technical conditions, the index policy is optimal. 
In [27], the authors provide a sufficient condition for the indexability of a single restless 
bandit. The authors in [29J investigate indexability under a set of conditions called Partial 
Conservation Laws (PCL). They identify a class of RMAB processes that satisfy the PCL 
and are indexable in the sense of Whittle. They also show that, under PCL, if the rewards 
belong to a certain "admissible region" then a priority index based allocation policy is 
optimal. In [31], the authors consider a RMAB process with improving/deteriorating jobs 
and establish indexability for the processes. They demonstrate, via numerical analysis, 
the strong performance of the index policy and obtain performance guarantees for the 
index policy. Thus we see that the notion of indexability offers a promising starting point 
towards the (otherwise intractable) analysis of optimal RMAB scheduling. 

Returning to the scheduling problem addressed in this work, assume that the near 
users in cell 1 are permanently paired (one to one) with far users in cell 2 and vice versa 
(requires N = F). Thus if a user is scheduled in a cell, under cell breathing its pair must 
be scheduled in the adjacent cell. We can now visualize each pair as a restless bandit with 
one and only one pair scheduled in any control interval. Thus the ARQ based scheduling 
problem is now a RMAB scheduling problem. When the permanent pairing condition is 
removed, we have a set of 2NF projects, considering all possible legitimate pairing across 
cells. These projects do not evolve independently and hence do not constitute a RMAB 
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process^). Thus we have a more complex variant of the RMAB processes. We call this 
the RMAB-v processes. To the best of our knowledge, there is no analysis of scheduling 
in RMAB-v processes. 

Note, from previous discussion, that the index policies are very attractive from an 
implementation point of view. From an optimality point of view, the attractiveness 
of the index policies can be attributed to the natural ordering of states (and hence 
projects) based on indices, as guaranteed by indexability. We are curious to see whether 
this advantage carries over to the RMAB-v processes at hand. As a first step in this 
direction, we perform an indexability analysis of the RMAB-v and obtain partial results 
on the structure of the index policy in the following sections. 

5 Indexability of the RMAB-v 

Following the approach of Whittle in [23], we consider the indexability of a single 
project made by a near-far user pair. We begin with the following definition: A func- 
tion f(xi,X2, ■ ■ -x n ) is component-wise piecewise linear and component-wise convex in 
(xi, x 2 , ■ ■ ■ x n ) if, by fixing arbitrary values along any n — 1 dimensions, /(.) is piecewise 
linear and convex in the remaining dimension. 

Lemma 6. The reward function, V t (W, 7i"i, n 2 ), is component-wise piecewise linear and 
component-wise convex in (W, 7i"i, ir 2 ), for any t > 1. 

Proof. Let T denote the family of functions defined over (WiTTi.TTz) G M+ x [0,1] 2 that 
are component-wise piecewise linear and component-wise convex in (W, tti, tc 2 ). Let L(x) 
be an arbitrarily defined affine function in x G [0,1]. The reward function at control 
interval t, when the state of the system is (L(tti), L(tt 2 )), is given by 

VtWLfa^Lfa)) = max(^ + ^_ 1 (^,T(L(7r 1 )),T(L(7r 2 ))),L(7r 1 )+L 1 (7r 2 ) 

+ L(7r 1 )L( 7 r 2 )F t _ 1 (W,p,p) + L(tti)(1 - L(7r 2 ))VU(W,p,r) 
+ (l-L(n 1 ))L(n 2 )V t . 1 (W,r,p) 

+ (1 - L(7n))(l - L(7r 2 ))K-i(W, r,r)) (17) 

Let (A) denote the following condition: V t -i{W, L(iii) , L(n 2 )) G T for any arbitrarily 
defined affine function L. 

Note that T(L(x)) is affine in x. Hence, under (A), both the arguments to the max 
operator are component-wise piecewise linear and component-wise convex in (W, 7Ti,7r 2 ). 
Since the max operator, across each dimension, effectively traces the top envelope of a set 
of piecewise linear, convex functions, the piecewise linearity and convexity is preserved 
across each dimension. Thus V t (W, L(7Ti), L(tt 2 )) G T under condition (A). 

We now proceed to establish the induction basis for condition (A). Since, by definition, 
Vi(J¥,L(7ri),L(7r 2 )) = max(W, L{jx x ) +L(tt 2 )), we readily see that Vi(W, Lfa), L(it 2 )) G 
T . Thus, by induction, from the preceding statements, V t (W, L(tti), L(ir 2 )) G J 7 , Vt > 1. 

With L(x) = x, the lemma is established. □ 

5 The projects must evolve independently in RMAB processes, by definition. 



17 



We now establish a relation between the expected future rewards accrued after active 
(schedule) and passive (idle) decisions. Let V t A (W, 7Ti, tt 2 ) and V t p (W,7i 1 ,7i 2 ), for t > 2, 
denote the expected future rewards (accrued from control interval t — 1) corresponding 
to active and passive decisions, respectively, at control interval t, with (vri,7r 2 ) being the 
state of the system at t. By definition, we have 

V t A (W,Tr U Tr 2 ) = 7r 1 7r 2 K_ 1 (Wip,p)+7r 1 (l-7r 2 )Vi_ 1 (Wip > r) 

+ (1 - m)n 2 Vt-iQy,r,p) + (1 - tti)(1 - 7r 2 )y t _i(W,r,r) 
Vf (W>i,7r 2 ) = V t -i(W,TM,T(n 2 )). (18) 

Lemma 7. TTie expected future reward accrued after an active decision is at least as 
high as the expected future reward accrued after a passive decision, i.e., V t (W, 7Ti,7t 2 ) > 
V t P (W } 7Ti, vr 2 ) ; Vt > 2, W > 0, (tti, tt 2 ) G [0, l] 2 . 

Proof. We begin by rewriting V t A (W, iti, 7r 2 ): 

^(W,7r 1;7 r 2 ) = 7r 2 (7r 1 ^_ 1 (^,p,p) + (l-7r 1 )^_ 1 (^,r,p)) 

+ (1 - 7r 2 )(7r 1 y t _ 1 (H/,p,r) + (1 - 7n)V t _i(Wir,r)) 

> 7r 2 ^_i(W,7np + (l-7ri)r,p) + (1 - Tr^V^W, tt iP + (1 - Tr^r, r) 

> Trip + (1 - 7Ti)r, vr 2 p + (1 - ?r 2 )r) 
= ^_ 1 (M/,T(7r 1 ) ) r(7r 2 )) 

= 7 t p (W,7n,7r 2 ) (19) 

where the second and third inequalities use the component-wise convexity of Vt-i along 
the 7Ti and 7r 2 dimensions, respectively (Lemma 6). □ 

Lemma 8. When W < 2r or W > 2p, the expected future rewards accrued after active 
and passive decisions are equal, i.e., V t (W, 7i"i,7r 2 ) = V t p (W,TT 1 ,n 2 ) , Wt > 2,(7ri,7r 2 ) G 
[0,1] 2 , W £ (2r,2p). 

Proof. We consider the following two cases. 

When W < 2r: 
Recall from ((I5j), 

Vf(W,7n,7r 2 ) = 7ri7r 2 V t _i(W,p,p) +^1(1-^)^1(^1?^) 

+ (1 -TrOTraK-iW^p) + (1 - 7n)(l - 7r 2 )F t _i(W, r, r) (20) 

Note that V t -i(W, r, r) = max(iy + V^^W, r, r), 2r + V^i(W, r, r)). Since, from Lemma 
7, K-i(W>i.T2) > ^(^Tn,^), W > 0,(7ri,7r 2 ) G [0, l] 2 , we have V t ^(W,r,r) = 
2r + V t A 1 (W, r, r) when W < 2r . Using the same argument for the other elements in (1201) , 
we have 

^(W,7ri,7r 2 ) = 7r 1 7r 2 (2p + ^ 1 (W;p > p))+7r 1 (l-7r 2 )(p + r + ^ 1 (W;j),r)) 
+(l-7r l )7r 2 {r+p + V t A _ 1 (W,r,p)) 

+ (1 - tti)(1 - 7r 2 )(2r + V t \{W, r, r)) (21) 
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Recall from (jlgj) . 

Vf(W,7n,7r 2 ) = Vt^W^i^),^)) 

= max (V + ^ 1 (W,T(7n),T(7r 2 )), 

T(Tn) + t(tt 2 ) + Kl x (w;r(7ri),r(7r 2 ))) (22) 

Note that 2r < T(7Ti) + T(tt 2 ) = Trip + (1 — 7Ti)r + 7r 2 p + (1 — n 2 )r < 2p. Thus, with 
K-i(Wir(7n),r(7r 2 )) > ^-i^Tfa),^)), when W 7 < 2r, 

Vj P ( W, 7Ti , 7T 2 ) = T(7r 1 )+T(7T 2 ) + ^ 1 (W,T(7T 1 ),T(7r 2 )) 

= TV i (p — r ) + r + 7r 2 (p — r) + r 

+(7Ti(p - r) + r)(7r 2 (p - r) + r)Vt_ 2 (W,p,p) 

+(7Ti(p - r) + r)(l - (7r 2 (p - r) + r))T4_ 2 (W,p, r) 

+ (1 - (7Ti(p - r) + r))(7r 2 (p - r) + r)T4_ 2 (W, r,p) 

+ (1 - (7ri(p - r) + r))(l - (tt 2 (p - r) + r))Vt- 2 (W, r, r) (23) 

Expanding V^^ 1 (W / , {r, p}, {r, p}) in (12ip using ( fl8l) along with the factB that Vt(W,p, r) = 
V t (W,r,p), we equate fl2Tl) and fl23|) . thus establishing the first part of the lemma. 

When W > 2p: 

Let 7Ti, 7r 2 be the belief value in any control interval t > 1. Then, in any control interval 
m < t, 7n| m<t G {r, p, r j (7Ti), T fe (r), T'(p)} and ir 2 \ m<t G {r,p,T^n 2 ),T k (r),T l (p)} with 
j < t — 1 and k,l < t — 2. Note that the belief values 7Ti, n 2 \ m<t G {r, p} if the schedule 
decision in the previous control interval was active. While the belief values take the other 
forms when there is a continuous stretch of passive decisions immediately preceding the 
control interval in question. By the definition of T k , we can see that r < T^(tt) < p, 
W G [0, 1] and j > 1. Thus we see that, Wt < t ,2r < 7Ti + 7r 2 | m<t < 2p. With this 
observation consider, from (ITS]) . 

VftW.TTLTrz) = 7r 1 7r 2 y t _ 1 (H/ ) p,p)+7r 1 (l-7r 2 )y t _ 1 (W,p,r) 

+ (1 - Tr^V^M/ r,p) + (1 -7n)(l - 7r 2 )V t _i(W; r, r) (24) 

Consider VJ_i (W 7 , p, p) from the preceding equation. From the preceding discussion, Vm < 
t — 1, 2r < 7Ti + 7T 2 | TO<t _i < 2p. Also note that 7Tj + 7r 2 1 1— i = 2p. Thus for any sequence 
of schedule choices and ARQ feedback, the immediate reward in any control interval 
m < t — 1 is upper bounded by 2p. Thus when W > 2p, it is optimal to not schedule in all 
control intervals from t—1. ThusVt-i(W > 2p,p,p) = (t—l)W. Using a similar argument 
we have V t ^{W > 2p,p,r) = V t -i(W > 2p,r,p) = V t ^{W > 2p,r,r) = (t - l)W . 
Therefore, from (j24p . 

V t A {W,w 2 ) = (t-l)W, V(tti, 7r 2 ) g [0, l] 2 , W > 2p (25) 

6 Thanks to the channel, reward structure symmetry across the users 
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From ({TBI) , the future reward after passive decision is given by 

Vf(^7r 1)7 r 2 ) = I4-iWT(7Ti),T(7r 2 )) (26) 

Since 2r < T(tci) + T(n 2 ) < 2p, V(7r 1 ,7r 2 ) G [0, l] 2 , using an argument similar to above, 
we have 

V t p (W,n h n 2 ) = (t-l)W, V(7T 1 ,7r 2 )G[0,l] 2 ,iy>2p (27) 

This completes the proof. □ 

Lemma 9. If for any t > 1, (7Ti,7r 2 ) G [0, l] 2 , W* is such that W + V t p (W, 7Ti, 7T 2 ) = 
7Ti + 7r 2 + V^(W, 7ri,7r2)|w=w*; ^ en ^ e project is indexable (in the sense of Whittle) at 
time t iffW* is unique. The index is given by I(t, 7ri,7r 2 ) = W*. 

Proof. We first consider the sufficiency part. For a fixed t, (tti, 7t 2 ) G [0, l] 2 , at W = W*, 
W-(tt 1 +tt 2 ) = ^ A (iy,7r 1 ,7r 2 )-\/ 4 p (iy,7r 1 ,7r 2 ). Note that, from Lemma 7, V A (W, n u tt 2 )- 
V^ P (W, 7Ti, 7r 2 ) > VW > 0. Also, W — (tti + 7r 2 ) is an increasing function in W with 
negative values V W < (tti + n 2 ). Using these, along with the uniqueness of W*, we 
readily see the following: 

W + V t P (W,Tn,Tr 2 ) < ni+7l2 + V t A (W,7T u 7T 2 ),yW < w* 

W + V t p (W,7i 1 ,n 2 ) > ir 1 +7T 2 +V A (W,7r 1 ,ir 2 ),VW> W* (28) 

This is precisely the definition of indexability with index I(t, 7i"i, tt 2 ) = W*. 

From the definition of indexability, we can readily see that uniqueness of W* is nec- 
essary for indexability. □ 

Claim 10. The ARQ based, single project, active/passive scheduling scheme is indexable 
and hence the RMAB-v is indexable. 

Our claim is based on extensive simulations, partially reproduced in Fig. HI 



6 Structure of the Index Policy 

Proposition 11. In any control interval t , under indexability, the index function satisfies 
the following: 

J(t,7Ti,7r 2 ) = 7ri + 7r 2 if tti + tt 2 (2r, 2p) 

I(t, 7Ti, 7T 2 ) G [TTi + 7T 2 , 2p) if TTj + 7T 2 G (2r, 2p) (29) 

Proof. Under indexability, in any control interval t, W* such that W + V/^W 7 , 7i"i, 7r 2 ) = 
/Ti + 7r 2 + V^ A (W, 7Ti, 7t 2 )|vk=vk* is unique and I(t, 7i\, tx 2 ) = W*. We now prove the first part 
of the proposition. When 7i\ + n 2 > 2p, at W = tx\ + 7r 2 , V t A (W, 7Ti, 7r 2 ) = V P (W, tti, tt 2 ). 
This follows from Lemma 8 since W = tti + tt 2 > 2p. Thus + V^ P (W, 7Ti, 7r 2 ) = 
tti +7r 2 + V^ A (W / , 7Ti, Tr 2 )\ w=7T1+7r , 2 > 2p leading to I(t, 7Ti, 7r 2 ) = = 7Ti + 7r 2 when 7Ti + tt 2 > 
2p. Using a similar argument for ix\ + 7T 2 < 2r, the first part of the proposition is 
established. Consider the second part with 2r < 7Ti + 7r 2 < 2p. If TU* > 2p, from Lemma 
8, V A (W, tti, 7r 2 ) = V P (W, 7Ti, 7r 2 ) leading to PU* = tti + 7r 2 . But this contradicts the 
condition tti + 7r 2 < 2p. Thus IV* < 2p. From Lemma 7, V A (W, m, ir 2 ) - V P (W, 7Ti, 7r 2 ) > 
0, VPU. Thus PU* > 7Ti + 7r 2 . This establishes the proposition. □ 
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Figure 4: Simulation suggesting indexability of the RMAB-v. For various values of 
(7Ti,7r 2 ), the function W — (tti + 7r 2 ) is shown to intersect the corresponding function 
V t A (W, 7Ti, 7r 2 ) — V t p (W, TTi, 7r 2 ) only once. Index is given by value of W at the point of 
intersection. Horizon length = 5 is used. 
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We now show that the index policy partially resembles the greedy policy. 

Proposition 12. Under indexability, the index based multi project scheduling policy has 
the following partial implementation structure: When the ARQ feedback from both the 
users in the scheduled project are ACKs, reschedule the pair. If both were NACKs, switch 
to another pair. 

Proof. Consider a control interval t. Let the state of the i th project be given by (ivi i, 7r 2i ;), 
i G {1, . . .N}. Let a t be the project scheduled in the control interval t. The state of 
any project i 7^ a t , in the next control interval t — 1 is given by (T(7r l5 j), T(7r 2ji )). Since 
2r < T(7Ti } i) + T(7T2 i i) < 2p, from Proposition 11, the index of project i in control interval 
t—1 is bounded as follows: 2r < Ii(t— 1, T(7Ti i i), T (7^2,1)) < 2p. The state of the project a t 
in the next control interval G {(p,p), (p, r), {r,p), (r, r)} depending upon the nature of the 
ARQ feedback from the two users constituting the project. Using Proposition 11, upon 
receiving ACK from both, the index of a t is I at (t — l,p,p) = 2p > Ii^ at (t — 1, 7r 1;i , 7r 2i i) 
and upon receiving NACK from both I at (t — l,r, r) = 2r < Ii^ at (t ~ 1> 7Ti,i> 7r 2,i)- This 
establishes the partial structure of the index policy. □ 

Claim 13. If, under indexability, in any control interval t, the sum of the belief values of 
the users in the projects + 112,%) are sufficiently separated from each other, the index 
policy is implemented as the greedy policy in that control interval. 

It was observed from numerical simulations that, with W as the independent variable, 
V t A (W, 7r 1; 7r 2 ) — V t p (W, 7Ti, 7r 2 ) is closely bounded for all states with equal ni + ii2- This led 
to closely bounded index value I(t, 7Ti, 7r 2 ) for this family of states identified by ix\ + tt 2 . 
In addition, the index value as a function of 71*1 + 7r 2 was observed to have a generally 
increasing pattern with tti + 7r 2 . Together, from these observations, it appears that the 
index value increases as 7i"i + 7r 2 is increased in sufficiently large steps. This is illustrated 
in Fig. [51 Fig. [6j Also note that the greedy policy chooses the project with the highest 
immediate reward, i.e., + 7r 2 j. This forms the basis for Claim 13. Note that, when 
every user is scheduled at least once in the past, the belief values of the users are outputs 
of the functions T h (p) or T k (r). Since no two users in a cell are scheduled in the same 
control interval and taking note of the structure of T h , a natural separation of 7Tj may be 
guaranteed across users within each cell with the degree of separation being a function 
of the channel statistics. 

7 Conclusion 

We have addressed the problem of multiuser scheduling with partial channel informa- 
tion in a multi-cell environment. By formulating the scheduling problem jointly with 
the ARQ based channel learning process and the intercell interference mitigating cell 
breathing protocol, we obtain optimal joint scheduling policies under various system 
constraints. We posed the original, unconstrained scheduling problem as a generalized 
variant of the Restless Multiarmed Bandit processes and introduced the notion of in- 
dexability relevant to these processes. We conjectured, with numerical support, that 
the multicell multiuser scheduling problem is indexable and partially characterized the 
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Figure 5: Simulation suggesting index as a monotonic function of 7Ti + 7r 2 . P — 
0.4809, r = 0.3294, horizon size of 5 is used. Intersections of functions W — (ni + 7r 2 ) and 
V^ A (W, tti, 7r 2 ) - V t p (W, 7Ti, 7r 2 ) are labeled by 7Ti + 7r 2 . 
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Figure 6: Simulation showing scattering of the index value within the family of states with 
equal ~K\ + TT2- With sufficient spacing between the families (711 + 7^), the monotonicity of 
index with 7Ti + 712 appears unaffected, p = 0.9861, r = 0.2043, horizon size of 5 is used. 
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index policy. Ongoing work focuses on obtaining theoretical support for our conjecture 
on indexability, in addition to a complete characterization, including performance study, 
of the index policy. 
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